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Abstract 

Vertex direction algorithms have been around for a few decades in the experimental 
design and mixture models literature. We briefly review this type of algorithms 
and describe a new member of the family: the support reduction algorithm. The 
support reduction algorithm is applied to the problem of computing nonparametric 
estimates in two inverse problems: convex density estimation and the Gaussian 
deconvolution problem. Usually, VD algorithms solve a finite dimensional (version 
of the) optimization problem of interest. We introduce a method to solve the true 
infinite dimensional optimization problem. 

1 Introduction 

During the past decades emphasis in statistics has shifted from the study of parametric 
models to that of semi- or nonparametric models. A big advantage of these latter models 
is their flexibility and ability to 'let the data speak for itself. However, also problems that 
were not usually crucial in the parametric case, turn out to be difficult in the semiparametric 
situation. The asymptotic distribution theory of estimators is one of these problems. The 
multivariate central limit theorem and the delta method give the answer to many questions 
regarding asymptotic distribution theory in the parametric setting. For the semiparametric 
situation, such 'basic tools' are not available. Another problem that is usually easier to 
solve in parametric models is the problem of computing M-estimators that are defined 
as minimizer of a random criterion function. In a parametric model often estimates 
can be computed explicitly or computed using some numerical technique for solving (low 
dimensional) convex unconstrained optimization problems like steepest descent or Newton. 
In semiparametric models, the computational issues often boil down to high dimensional 
constrained optimization problems. 

Apart from algorithms that are known from the general theory of optimization, 
algorithms have been designed within the field of statistics that are particularly useful 
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in certain statistical applications. Perhaps the best known example of this type is 
the Expectation Maximization (EM) algorithm of Dempster, Laird and Rubin (1977)] 
that is designed to compute maximum likelihood estimates based on incomplete data. 
Another example is the iterative convex minorant algorithm that is introduced in 



Groeneboom and Wellner (1992) and further studied in Jongbloed (1998) That 
algorithm is based on techniques known from the theory of isotonic regression as can be found 
in [Robertson, Wright, and Dykstra (1988)1 and can be used to compute nonparametric 



estimators of distribution functions in semiparametric models. Another class of algorithms 
that falls within this framework is the class of vertex direction (VD) algorithms. 

In section |21 we introduce the general structure of VD algorithms and mixture models 
where VD algorithms can be used to compute nonparametric function estimates. Two specific 
examples of these mixture models will be considered in subsequent sections: estimating a 
convex decreasing density and estimating a mixture of unit variance normal distributions. 

In section El we introduce the support reduction algorithm as a specific member of the 
VD family of algorithms. This algorithm essentially replaces the original infinite dimensional 
constrained optimization problem by a sequence of finite dimensional unconstrained 
optimization problems. The algorithm is designed to keep the dimension of these sub- 
problems as low as possible. For a specific type of statistical models, the algorithm seems to 
be a good candidate to compute sensible estimators. These are problems that are difficult 
from the asymptotic statistical point of view in the sense that the convergence rate of the 
estimator is relatively low. 

All VD algorithms have to deal with a problem of minimizing a "directional derivative" 
function over some set of parameters. There are some variants of these functions. For 
quadratic objective functions, we will describe an alternative directional derivative function 
in section 13 that takes more local information of the objective function into account. 

The directional derivative function (and our alternative) are usually nonconvex functions 
on a continuum of parameters. Usually the associated nonconvex minimization problem 
is circumvented by considering a fine grid within the parameter space and minimizing the 
function only over that grid. In section El we propose a method of "leaving the grid" , 
tackling the infinite dimensional optimization problem rather than the finite dimensional 
approximation. 

Section El is devoted to least squares estimation within a mixture model. The general 
procedure is given and for the problem of estimating a convex and decreasing density 
based on a sample from it, will be considered in detail. In that situation the support 
reduction algorithm boils down to what is called the iterative cubic spline algorithm in 



Groeneboom, Jongbloed and Wellner (2001b) 



In section [71 the general problem of computing a maximum likelihood estimate within 
a mixture model will be addressed. A Newton procedure based on the support reduction 
algorithm will be described. The normal deconvolution problem will serve as example to 
illustrate the general approach. 
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2 Vertex direction-type algorithms 

Consider the following type of optimization problem 

minimize 0(/) for f E C (2-1) 

where is a convex function defined on (a superset of) a convex set of functions C. We 
assume throughout that has a unique minimizer over C. 

Assumption Al: is a convex function on C such that for each f,gEC where is finite, 
the function t (f){f + t{g — /)) is continuously differentiable for t G (0, 1). 

Now define, for each f & C and h a function such that for some e > 0, / + e/i G C, 

D^{h-f) = \\me~' {^{f + eh) - <j>{f)) 

Note that this quantity exists (possibly equal to oo) by convexity of 0. As we will see, a 
choice often made for h is h = g — f for some arbitrary g & C . In that case we have 

D^{g -f;f) = lime-i (0(/ + e{g - /)) - 0(/)) 

The following simple but important result gives necessary and sufficient conditions for / 
to be the solution of ()2.1|) . 

Lemma 2.1 Suppose that satisfies Al. Then 

f = argminj^c'PU) if (^^d only if D^{g - f; /) > for all g eC . 

Proof: First we prove Suppose / = argminjg(^0(/) and choose g E C arbitrarily. Then, 
for e i 

< e-^(0(/ + e{g - /)) - 0(/)) i D^{g - /; /) 

Now <^=. For arbitrary g & C, write r for the convex function e ^— > 0(/ + e{g — /)) and note 
that 

m - = - r(0) > r'(0+) = D^ig - f; /) > . 

□ 

Consider now the case where C is the convex hull of a class of functions 

J^={fg : eeecR"}, (2.2) 

in the sense that 

C = conv(jF) = i^g = J fgdniO) : probability measure on oj> . (2.3) 

Here are two examples of mixture models that fall within this framework. These examples 
will reappear in subsequent sections. 
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Example 1. (convex decreasing density) 

The class of convex decreasing densities on [0, oo) has representation (|2.3|) with 



fe{x)= ' ' imix), e>o. 

It is obvious that any (positive) mixture of these functions is convex and decreasing. Since 
the mixing measure is a probability measure, it also follows that the mixture is a probability 
density. To see that any convex and decreasing density can be written as mixture of /g's, 
note that the measure defined by d^{9) = ^9'^df'{9) gives 



fe{x) dM = J ' ' dM = J (e-x) df\e) = fix) . 

Situations where the problem of estimating a convex and decreasing density based 



on a sample from it is encountered, can e.g. be found in Hampel (1987) and 



Lavee, Safrie, and Meilijson (1991)] □ 



Example 2. (mixture of unit variance normals) 

The Gaussian deconvolution problem as considered in e.g. 
Groeneboom and Wellner (1992)[ entails estimation of a density (and associated 



mixing distribution) that belongs to the convex hull of the class of normal densities with 
unit variance: 

fe{x) = ^e-i(-^)^ . 

□ 

In the examples just considered, usually one has a sample Xi,X2,...,X„ from the 
unknown density / G C, and wants to estimate the underlying density / based on that 
sample. In this paper we consider two types of nonparametric estimators: least squares (LS) 
estimators and maximum likelihood (ML) estimators. 

Least Squares estimation. 

We define a least squares estimate of the density in C as minimizer of the function 

over the class C. Here F„ is the empirical distribution function of the sample. 

The reason for calling this estimator a LS estimator, is the following heuristic. For any 
(arbitrary) square integrable density estimate /„ of /o, one can define the LS estimate as 
minimizer of the function 

2 



f^T^ I (fit) - my dt = oj^ fity dt - /w/nW dt + (2.5) 



over the class C. It is seen that, as far as minimization over / is concerned, fl2.5|) only 
depends on the density /„ via its distribution function. The objective function in ()2.4p is 
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obtained by taking the empirical distribution function for this estimator, so we take formally 
fn{t) dt = d¥n{t) in (j231)- Note that for objective function 

D^h; /) = lime-^ + eh) - <j){f)) = j h{x)f{x) dx - j h{x) rfF„(x) . 
Maximum Likelihood estimation. 

As maximum likelihood estimate we define the minimizer of the function 



0(/) = - J \ogf{x)d¥4x) 
over the class of densities C. Note that for this function 

D^h; /)=lime-i + e/,) - 0(/)) = - J j^^d¥^{x). 

For both objective functions 0, the function has the linearity property stated below. 

Assumption A2: the function (p has the property that for each f E C and g = 
J^fedfigie) e C, 

D^ig -f-J)= [ D^ife - /; /) df^gie) . (2.6) 
Je 

Under this additional assumption, the nonnegativity condition in lemma ITT] that has to hold 
for each g E C, may be restricted to functions g E 

Lemma 2.2 Suppose that C =conv{J-') with T as in \2.^) and that satisfies Al and A2. 
Then 

f = argminf^fj(t>if) ".^d only if D^{fe - f;f)>0 for all 9 eQ . 
Proof: Follows immediately from lemma ITT] the fact that fe^C and ()2.6|1 □ 

For the situation of Lemma 12. 2^ there is a variety of algorithms to solve 1)2.11) that can 
be called 'of vertex direction (VD) type'. A common feature of VD algorithms is that they 
consist of two basic steps. Given a current iterate /, find a value of 6 such that D^{fg — /; /) 
is negative. (If such a value cannot be found, the current iterate is optimal!) This means 
that travelling from the current iterate in the direction of fe would (initially) decrease the 
value of the function 0. 

Having found such a feasible profitable direction from the current iterate, the next step 
is to solve some low-dimensional optimization problem to get to the next iterate. 



The original algorithm, proposed by Fedorov (1972) and Wynn (1970) in the context 



of computing an optimal design, as well as the algorithm proposed by Simar (1976)| (for 
computing the maximum likelihood estimate of the mixing distribution in a Poisson mixture) 
that we will come back to later, implement the first step as follows. Given the current /, 
find 6 corresponding to the minimizer of D^{fg — /; /) over 0. 



Fedorov (1972) and Wynn (1970)| then propose to take as new iterate the function 

e)/ + efs 
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where e is given by 

e = argmin^gp - e)/ + e/^) . 

In words, the next iterate is the optimal convex combination of the current iterate and the 
most promising vertex in terms of the directional derivative. It is clear that usually the next 
iterate has one more support point than the current iterate. 

The vertex exchange algorithm as proposed in [Bohning (1986)] not only uses the 
parameter 6 corresponding to the minimizer of D^{fQ — f;f), but also the maximizer 6 
of D^{fg — f; f) restricted to the support points of the current iterate to get the direction. 
Denote by the mass assigned by the mixing distribution corresponding to / to 6. 

Then the direction given by the algorithm is / + f^f{{6}){f§ — fg)- The new iterate becomes 

f + efifimife - fe) 

where 

e = argmin,g[o + efif{{9}){fg - fg)) . 

If e = 1, the point 6 is eliminated from the support of the current iterate, and the mass 
assigned to 6 by the 'old' mixing distribution, is moved to the new point 6. It is clear that 
in this algorithm the number of support points of the iterate can increase by one, remain 
the same, but also decrease by one during one iteration (if e = 1 and 6 already belongs to 
the support). In specific examples, the number of support points of the solution / is known 
to be smaller than a constant N which only depends on the data (and is known in advance). 
In the context of random coefficient regression models. Mallet (1986) [ proposes to restrict 
all iterates to having at most support points. 



Another variation on the theme is due to Lesperance and Kalbfleisch (1992)] It is 



called the intra simplex direction method. The set of all local minima {^i, . . . , 6^} of D^{fg ■ 
/; /), where is negative, is determined and the optimal convex combination of the current 
iterate and all vertices fg-^,..., fg^ is the new iterate. This final step is to minimize a convex 
function in the variables ei. . . . , under the constraint < Y^^i ^ 1- 

The aforementioned algorithm proposed by [Simar (1976)| and further studied in 
Bohning (1982)( sticks to the original idea of picking one 6 corresponding to a profitable 
direction. The second step differs from those indicated above. Denote by Sf the set of 
support points of the mixing measure corresponding to a function f E C. Then, given 6, the 
next iterate is given by 

g = argmin;,gc'(/)</'(^)' where C{f) = {h e C : ShC SfU {§}} . 

It is to be noted that support points can (and usually do) vanish during this second step. 
Under certain conditions, Bohning ( 1 982 )| proves convergence of this algorithm. 



In section |3] we revisit Simar's algorithm and propose an extension of it that can deal 
with the case where C is the convex cone rather than convex hull generated by J-". This is 
convenient for the examples we consider. Moreover, we will introduce an algorithm that is 
closely related to Simar's algorithm: the support reduction algorithm. 
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3 Support reduction and Simar's algorithm 

In Simar's original algorithm, two optimization problems have to be solved. The first is to 
minimize the (usually nonconvex) function D^{fQ — f; f) in 6. The second is to minimize 
over the convex set of functions that is generated by finitely many functions from JF. In many 
examples (including the examples considered here), this second step gets more tractable if 
we were allowed to minimize over the convex cone generated by these finitely many functions 
in JF. In this section we therefore consider our function class JF and the convex cone C 
generated by it: 

C = cone(jF) = = J fgdfi{9) : fi positive finite measure on 6 

As will be seen in section [B] and [7[ our two examples fit within this framework of 
minimizing over the convex cone generated by a set of functions. Assumption A2 is 
now replaced by the following. 

Assumption A2': the function (p has the property that for each f E C = cone(^) and 
9 = Jefe d^g{6) G C, 

D^{g-f) = j D^{fe-J)dfig{e). (3.7) 

Remark. Suppose that hi and /i2 are such that for a small positive e, / + ehi G C for 
i = 1,2. Then, since C is convex, we have that / + ^e{hi + /i2) G C, and D^{-; f) is well 
defined at hi, h2 and /ii + /12. Assumption A2' then implies the following linearity property: 

D^ihi+h^; f) = j D^ifg; f) dfi^.+hM = j D^{fg; f) d (/x,, + (9) = D^^hi, f)+D^{h2; f) 

(3.8) 

Remark. Assumption A2' implies A2 for g G conv(jF). Indeed, take g = fe d^g{9) G 
conv(jF), meaning that [ig is a probability measure. Then we have, also using (j3.8|) . 

D^{g-f-J) = D^ig;f)-D^{f;f) = J D^{UJ)dfigi9)-D^{f;f) 

D^ife; f) - D^if; f) rf/i,(0) = [ D^{fg - /; /) diig{9) . 



Let us formulate a result for a generated cone analogous to lemma 

Lemma 3.1 Let C = cone(jF) and cf) satisfy Al and A2'. Suppose that the measure /i^ in 
f = fe fe dfij{9) has finite support. Then 

> for all 9 eQ 



f = argmzn^,am zf and only zf D.ife; f) \ ^ ^ for all 9 e supp{^, f) . ^^.9) 



Proof: If / = argminjg(^0(/), then we have by Al that 

D^U; f) = Ihne-i (^((1 + e)f) - 0(/)) = . 
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Hence, by fl3.8|) and lemma 12. H we have for all ^ G 

D^ifg; f) = D^ifg - / + /; /) = D^ife - /; /) + D^if; f) = D^{fe - /; /) > (3.10) 
In view of property ()3.7|) . we have 

= D^(/;/) = j D^ife;f)diXfie). 

In the presence of the inequalities in ()3.10|) we therefore have that D^^fe; /) = on the 
support of fij necessarily. 

Conversely, if / satisfies the (in)equalities given in ()3.9|) above, we have for any f (z C 
that 

- m > D^if - f; /) = D^if; /) = J D^{fg; /) dfif{e) > . 

□ 

Remark. The assumption that the support of /i^ is finite seems to be restrictive and 
unnatural. However, there are many problems (including our examples) where this is true. 
Of course, if G is finite it is trivially true (this e.g. covers interval censoring problems). 
Moreover, maximum likelihood estimators in mixture models usually have this property 



( Lindsay (1995) theorem 18, section 5.2). 

Below we give the pseudo code for Simar's algorithm constructed for a cone and also for 
the support reduction algorithm we propose. In fact, as will be seen below, the support 
reduction algorithm is Simar's algorithm where one substep is not completely followed till 
the end. 



Basic Simar- and support reduction algorithm for a cone 



Input: 

T) > 0: accuracy parameter; 
^(0) g Q. starting value; 

begin 

while min^ee D^{fe; f) < -1] do 
begin 

e := argmingge^9i(/e; /); 
S* := Sf U {§}■, 

f := argmiiig^c :S,cs*<P{9y, (Simar) 
f := aigmmg^c:SgCsS-(l^i9); (Support reduction) 
end; 
end. 



The meaning of 'Cs' will become clear in the sequel. For both algorithms, there are two finite 
dimensional optimizations that have to be performed. The first one is over 0. In general 
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the function 9 ^— D^{fQ] f) is nonconvex and minimizing such a function is usually difficult. 
Hence, in each setting one should try to take advantage of the specific features of that 
problem to attack this first optimization problem. Usually one can restrict the minimization 
to a bounded subset of 9 and use a fine (finite) grid in this subset instead of the whole 
set B. Then the minimization reduces to finding the minimal element in a (long) vector. 
After that, it is possible to 'leave the grid' in a way as described in sectional Sometimes 
(e.g. when computing the ML estimator of a distribution function based on interval censored 
observations) it is even possible to select a finite subset of 9, based on the data, such that 
the minimizer of over C is contained in the convex hull of the corresponding finitely many 
generators. In subsequent sections, we will address this matter more specifically in the 
examples. 

The second optimization in the algorithm is over a convex cone that is spanned by finitely 
many functions fg in JF. Lemma ITT] gives a characterization of such a function (applying the 
lemma to the finite subset S* of O instead of B itself). We propose the following general way 
of solving this finite dimensional constrained optimization problem in Simar's algorithm. In 
passing it will become clear what the support reduction algorithm does. 

Given the current iterate and the new support point 9, consider the linear space L spanned 
by the finitely many functions {fo : 6* G S*}: 

Ls* = = j fo da{9) : cr is a finite signed measure on S"* j> , 

and determine 

= argmin,,^^,0(^?) = j foda^^ ^M = E fo^f..s*m) ■ 

We assume has a smooth convex extension to the space Ls*- In our examples and many 
others this is certainly the case. This optimization corresponds to finding a solution of a 
finite system of equations. Of course, /i'^'* will in general not be an element of C, since 
certain coefficients o'ao)({9}) may be negative. Nevertheless we can always move from / 

towards fu^ and stay within the class C initially. This is a consequence of the fact that the 
coefficient cr^(o){{9}) of in fu^ will be strictly positive. Indeed, 

> hme-i (0(/ + e/f ) - 0(/)) = J D^{fg; f) da^^o^ = cx^,4{9})D^{U, f) 

and D^{ff, /) < by choice of 9. If fu^ G C then take this as next iterate. Otherwise define 
A = max{/ G (0, 1] : f+l{fjfi'>-f)eC}= min (1-a.o) ( W)/^^ W))-^ (3.11) 

and take as next iterate the function / + i{fu^ — /) and delete the support point 9 G S* 
where the minimum in the expression on the right hand side of 1)3.111) is attained from the 
support set: 
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Then compute the next unrestricted minimizer 

fu^ = argmin^g^^^^^^(/)(^) . 

If this function differs from the current iterate, again a step of positive length can be made 
in this direction, since for all 9 e S*^^\ aji) {{9}) > 0. If we can go all the way to f^l*, 

stop the iteration, and else delete the support point as it was done in the first step. This 
deletion of support points can be continued until we get a subset S*^^ C S* and a function 
f^^l* e C with support set S*^^^ such that 

^<^(/e;/S*) = Oforall 9eS*^^K 

The specific set S*^^^ obtained in this way as subset of S* is denoted by 'Cs', and this gives 
the next iterate in the support reduction algorithm. Note that the function (p is decreased 
all the way during the iterations of this substep. 

For Simar's algorithm, one should check for the points in S* \ S*^^^ whether the value 
can be improved upon by adding such points to the current support. The natural thing 
to do then is to take the value of 6' G S*^^^ where D^{fg; /) is minimal and add this to the 
support. In the support reduction algorithm we skip the adding of deleted points from S* 
and allow the next support point to be chosen without restriction from the whole set ©. 

Let us summarize the steps sketched above to determine / := argmin^g,^. 5^^^5*0(51) in 
pseudo code. 



Support reduction step 



Input: 

= f £ C: minimizer of over subset of C consisting of functions with same support Sf; 
5*(o) ^ S* ^ SfU {§}: finite set of support points; 
j := 0; 

begin 

while fi^^ ^ C do 
begin 

e^{9eS* : a.u-i){{9}) < and (TM-i)({9})/afU-i){{9}) is minimal}; 
A = (1 - fT.o-i)({^})/a^u-i)({^}))-i for some 9 e O; 

/0) = /o-i) + A(/;(^-^)-/o-i)); 

fn^ = argminggi^^^.^0(5); 
end; 

/ := fu"^ e C: minimizer of (p over subset of C consisting of 

functions with same support Sf = S*^^^ Cg S*; 

end. 



10 



We now see that the basic building stone of the algorithm is an unrestricted minimization of 
the function 0. As we will see in the sections E] and |7l there are efficient algorithms to solve 
this kind of optimization problems in specific situations. 

Before applying the algorithm to concrete problems, let us consider the convergence 



issue. The theorem below (the proof of which is inspired by Bohning (1982)1 states that 
the algorithms considered in this section indeed converge to the solution of the optimization 
problem. To get this, we need one additional condition on the function (p. This condition 
is needed to guarantee that a strictly negative value of D^^fo] f) for some 9 means that the 
next iterate will have some minimal decrease in 0- value. 

Assumption A3: For any specific starting function f^^^ G C with (f){f^^^) < oo, there exists 
an e G (0, 1] such that for all / G C with 0(/) < 0(/^°'') and 6' G O, the following implication 
holds: 

D^ife - /; /) < -5 < ^ 0(/ + e{fe - /)) - 0(/) < -h6 for all e G (0, e] 



We will see that this assumption holds for the problems we will address in subsequent 
sections. 



Theorem 3.1 Denote by fn a sequence generated by one of the algorithms introduced here. 
Then, under the assumptions Al, A2' and A3 we have that (f){fn) i <P{f) as n oo. 

Proof: Since we have for each n that 

fn argmin^gc: Sf=Sf„<P{f)^ 
we have by assumption Al that D^{fn] fn) = 0. Hence, by (|3.8p . we have for all n > 

D4,{fe - fn, fn) = D^ife; fn) for all deO. 

Since 4>{fn) is a bounded and decreasing sequence of real numbers, it decreases to a limit. 
Assume for the moment that 0(/n) I (p* = 4>{f) + S > (j){f) for some 6 > 0. We will extract 
a contradiction. 

Take 9n such that D^{f0,^; fn) < linf^ge D^{fg; fn). Then we get 

DMer. - fn; fn) = D^ifg,^; fn) < ^ mi D^{fe; fn) < [ ^-D^ife; fn) dfifiO) 

= ^D^if - /„; fn) < i (0(/) - 0(/n)) < I (0(/) - 0*) = -5X2.12) 

Again by monotonicity of (^{fn), we have that 0(/„) < 4>{f^^^) for all n, and assumption A3 
gives 

0(/n+i) < 0(/n + e(/e„ - /„)) < 0(/„) - leS for all n. (3.13) 
This contradicts the fact that 0(/n) converges. □ 
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In view of the convergence proof, there are some adaptations of the algorithms that do not 
destroy the convergence property of the algorithm. The first adaptation has to do with the 
choice of the most promising vertex. If the function on B is replaced by a function 

where w is some strictly positive weight function on O such that 

< w < w{6) <w<oo for all 6 e <d . 
Equation (|3.12p would then change to 

- fn, fn) = D^ife^; fn) < (w/w) inf /„) < -{w/w)6 , 

and the argument goes through with S replaced by {w/w)6. Similarly, A3 will also hold for 
if it holds for D^. In section 0] we will use this idea to define an alternative directional 
derivative function. 

The second adaptation is the following. If it is possible after reduction by deletion of 
support points to do an extra step of reduction by replacing two support points by a third or 
move a support point slightly without increasing the function 0, this will not prevent the 
algorithm from converging. This immediately follows from (|3.13|) . Indeed, if we replace the 
iterate /„ that would be obtained by the original method by /„ which satisfies 

fn = argmin^g^^ ^ ^ 0(/) and < 0(/„), 

the inequality (j3.13p also holds for /„+i instead of /„+i and the proof goes through. This 
adaptation of the algorithm will be discussed more elaborately in sectional 



4 Alternative directional derivative 

Consider a quadratic objective function 0^ on C = cone({/e : 9 E 0}). The objective 
function in the LS estimation context is quadratic automatically and in section [7| we will 
use a Newton algorithm to solve the ML estimation problem. In that algorithm a quadratic 
approximation of the objective function is minimized during each iteration. 

The function 0g is quadratic in /. Hence, along line segments in the linear space spanned 
by the functions fei,---fep, the function is also quadratic as a function of one variable. 
Along such segments (or lines), the function (pg can therefore be minimized explicitly. Given 
a 'current iterate' g in the algorithm, we consider for each 6 E Q the following function 
(alternative choice is to take fe — g instead of /g): 

e ^ <Pqi9 + e/e) - Md) = Ci(^, g)e + ^e^C2(^) . 



2 



Typically, C2 > 0, so that 

e = eg = argmin^0g(5( + efg 



ci{9,g) 
' C2{e) 
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is the optimal move along the line connecting g and g + fe- 

In order to have descent direction, we only consider points 6 where Ci{6,g) < 0. In that 
case, e > 0. As new vertex, we then define 

a ■ A t > ■ ci{e,gf . ci{e,g) 

e = argmm0ge:ci(e)<o09(^ + = argmmgeQ, (e)<o - ^ fa\ = argmmggg . 

5 A 'gridless' implementation 

For a practical implementation of the step of selecting a new support point, we propose to 
fix a fine grid Qs in © and run the whole algorithm with B5 instead of B. Having a precise 
approximation of the minimizer fgrid of over this finite dimensional cone, one can make the 
algorithm 'gridless' by fine tuning in the support points. This can be done by augmenting a 
step at each iteration in the spirit of the second remark after theorem 13.11 

Write / for the current iterate (at the first 'fine tuning step', this is fgrid) and define 

r(/ii, /i2, • • • , hp) = r(/ii, /i2, • • • , hp] f) = (p oiife^+ii^j - <P i^Yl "^•^^ 

with ai, . . . , Op fixed and h = [hi, . . . , hp)'^ varying over a neighborhood of zero in M^. The 
function r represents the value of the objective function if the masses are fixed and the 
current support points are shifted a bit. Abusing notation slightly, write 



di+hi 

i=l 



and note that for the least squares objective function (under mild smoothness assumptions 
on the parameterization of fe) 



dr 

dh 



— {hi,h2,...,hp)=ai I fe^+hXx)fh{x)dx - ai / /e^+/,^(x) c?F„(x) (5.14) 



and for the maximum likelihood objective function 

^(/ii, /i2, . . . , hp) = -a, j iMh^dW^x) . (5.15) 
In particular, note that 

^(0)=a4^^(^'/)l^-^» 

for both objective functions. Hence, the partial derivatives of r at zero are visualized in the 
graph of 6^ I— >• D^{fe; f) for both objective functions. Qualitatively, the interpretation of the 
partial derivatives of r is that if ^t(O) < 0, shifting the support point 9i slightly to the 
right will result in a decrease of the objective function. For the moment, fix /i G with 
II /i II 2 = 1 and consider the function 



/Xfe(e) = r(e/i) 
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on an interval [0, eo] for some small eg > 0. Note that Th{0) = 0. Then (again under mild 
smoothness assumptions) the derivative of fih is given by 

where VT{eh) is given either by (j5.14j) or (|5.15p . depending on the objective function. Taking 
e = 0, we see that the 'most promising' direction to move, is the direction — Vr(0), the 
direction of steepest descent. From now on take this direction. The aim is now to move the 
support points in this direction to get a sufficient decrease in the objective function. This 
means that fih is to be minimized as a function of e e (0, eo), or at least a value e has to be 
determined such that yU/i(e) is negative. Note that the function fih is nonconvex in general. 
We determine the step length by the method of regula falsi on the derivative fih- At zero this 
function is zero. Define = and = eo. If fJ'hi^u) < then take this e = eo. Otherwise 
proceed as follows. 

If /U^(e„) > define eu = e„ whereas if fi'^i^n) < define ei = e„. This process can be iterated 
till n'h^en) is sufficiently small in absolute value. This regula falsi method comes up with a 
stationary point of Hh- If the fihi^n) is positive, the line search procedure should be repeated 
with eo = ce„ for some < c < 1 (usually close to one). In our experience this step is hardly 
ever necessary, but conceptually it is needed. The procedure will (in case e 7^ eo) lead to a 
stationary point of fih with /^^(e) < /i/i(0) = 0. Actually, e will usually correspond to the 
smallest local minimum of fit- 
Next, define 

p 

/ := ^ aifg^+^h, ■ 

i=l 

The new iterate / is finally obtained by minimizing over the cone generated by {/g^+^/j. : 
1 < i < p}. This function satisfies the conditions needed at the beginning of the just 
described 'fine tuning' step. Hence, it can be iterated till the norm of /i'/i(0) is sufficiently 
small. 

6 LS estimation of a convex density 

In this section we study the problem of computing the least squares estimate of a convex 
and decreasing density on [0, cx)). In [Groeneboom, Jongbloed and Wellner (2001b)[ it 
is shown that the (uniquely defined) minimizer of the convex function over conv(^) is the 
same as the minimizer of over cone(JF). It is also shown that there only functions fg with 
6 G [a;i,i^'] for some K < 00 have to be considered in the optimization, since the optimal 
function has no change of slope at a location to the left of Xi and has compact support. 
Hence we are in the situation of section ^ Moreover, we have 

feix)f{x) dx - feix) d¥^{x) = - {H{9; /) - r„(0)) 
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where 



H{e-J)= ! [ f{y)dydxa.ndYr,{9)= f¥n{x)dx: 

Jx=0 Jo Jo 



here we use the same notation as in |GROENEBOOM, Jongbloed and Wellner (2001b) Note 
that the assumptions Al, A2' and A3 are satisfied in this situation. For A3 note that 

POO 

0(/ + eife - /)) = m + eD^ife -/;/) + / ifo{x) - f{x)f dx (6.16) 

Jo 

and that for 6 E [xi, K] 

{fe{x) - f{x)f dx = -- -H{e- f) + f{xf dx<M 

for some big finite M not depending on 6. 

Let us now consider the support reduction algorithm. To start this algorithm, we fix a 
starting value 9^'^^ > Xn- Then we determine the function cfg^ minimizing (f) as function of 
c > 0. To this end we need the value c that minimizes 



c/e(o)) = IcM fg(o){xY dx - c I fornix) d¥n{x) 



2c2 2c(^(o) 



3^(0) (^(0))2 



giving c = |(1 — Xn/0^^^). If Xn < 3x„, one could also choose to take 9^^^ = 3x„, so that the 
starting function f^^^ would be a density. 

The two main steps are minimizing D^ifo; f) as a function of 6 and minimizing the 
function over the space of piecewise linear functions with bend points in a finite set S*. 
For the first step, we follow the line of thought given in section |3] In this example we have 
for all e > that 

^{f + ef0) = (P{f) + ec^{e,f) + ^e^C2{6) with ci{e,f)=D^{fe;f) and C2{9) = ^ . 
Hence, we have as 'alternative directional derivative' function 

\/C2{9) 

where ~ denotes 'equality apart from a positive multiplicative constant'. Note that, since 
w{9) = \f9 is strictly positive and uniformly bounded away from zero and infinity on B = 
we are in the situation described below theorem 13.11 Note that 9 i— > D^{fg; f) is 
continuous, 

D^{fe-J) = at 9 = and lim D^{fg-J) = . 

Hence, if D^{fg; /) < for some 9, it attains its minimal value. 

The second step in the algorithm boils down to the following procedure. Write S* = 
Sf U {9} = {6*1, 92, ... , 9m} with 9i < ■ ■ ■ < 9m and construct a cubic spline P with knots 
{^^1, 92, ... , 9m} such that 

P{e) = Yn{9) for all 9 G S*, P(0) = P'(0) = P"{9m) = . (6.17) 
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Note that the second derivative of this cubic sphne minimizes the function within the class 
of hnear sphnes / with knots concentrated on the set {6*1, 6*2, . . . , ^m} subject to the boundary 
constraint that 1(6^) = 0. This follows by setting the derivatives of in the directions fg. 
equal to zero, i.e. Dtp{fo.; f) = 0. 

Figure 1 shows the results of the SR algorithm based on a sample of size 500 from the 
standard exponential distribution. The solution on an equidistant grid in [0, 3x(„)] = [0, 16.5] 
consisting of 1000 points was obtained after 33 iterations. Furthermore, we used accuracy 
parameter rj = 10~^°. 



(a) (b) 




02468 02468 



Figure 1: (a) LS estimate of the mixing distribution with the true Gamma (3) mixing 
distribution; (b) LS estimate of the mixture density with the true density; (c) the (alternative) 
directional derivative function evaluated at the LS estimate and (d) LS estimate of the mixture 
distribution with the empirical distribution function of the data. All pictures are based on a 
sample of size n = 500 from the standard exponential distribution. 



7 ML estimation in Gaussian deconvolution 

In order to apply the support reduction algorithm of section El the setting of Example 2 is 
not appropriate since the minimization there has to be performed over the convex hull of 
the functions fg instead of the convex cone generated by them. Contrary to the situation of 
section ini the minimizer of over the cone does not exist (given a function / with </>(/) < 0, 
the function applied to c ■ / for c > tends to minus infinity). To get a well posed 
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optimization problem over the convex cone so that its solution is the minimizer of </> over the 
convex hull of JF, we have to relax the constraint that the solution has to be a probability 
density. The new objective function then becomes 



= -J logf{x)d¥n{x) + J f{x)dx. 

In principle, the support reduction algorithm can be applied directly to the thus obtained 
optimization problem. However, we observed that a Newton-type procedure (based on the 
support reduction algorithm) worked significantly better than the direct application of the 
support reduction algorithm. We describe this Newton procedure here. Write / for the 

current iterate. 
Note that 

0(/) - 0(/) = - 1 log (^1 + dFnix) + J fix) - fix) dx 

For (/ — /)// small, we get the following quadratic approximation of (p at /, using the second 
order Taylor approximation of the logarithm at 1 



/(x)-/(x)V fix) -fix) 



d¥nix)+ / fix)- fix) dx 



2 V fix) J fix) 
Ignoring terms that do not depend on /, we define the following local objective function 



1 ffix)V Ji^ 



Mf) = Mf-^ f) = J /(^) dx + J 2 " ^7(4 '^^"^''^ 

This quadratic function can be minimized over the (finitely generated) cone using the support 
reduction algorithm, yielding 

fg = argmin{(/)g(/; f):fe cone(/0 : 6 G 65)} 

The next iterate is then obtained as / + A(/g — /) (A chosen appropriately to assure 
monotonicity of the algorithm). 

This method is used to solve the (finite dimensional) optimization problem over the cone 
of functions generated by {fe : 6 G 0,5}. After this, the fine tuning in support points 
(leaving the prespecified grid) is performed as described in section 

During the Newton iterations to obtain the solution to the finite dimensional problem as 
well as in the fine tuning step following it, quadratic optimization problems of the type find 

argmin{0g(/) : / G cone(/e : 9 E S] 

are solved for some finite set S. Starting from an initial value, say g (the natural candidate 
for this will be obvious from the context; usually it has only a few active vertices), the 
support reduction algorithm consists of two steps that are iterated: 

1) Find new support point 
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2) Do finite dimensional constrained optimization using iterative unconstrained 
minimizations. 



Step 1. In the notation of section HI we liave 

ci{e,g) = j fe{x)dx~2 j k(x)dF^{x) + J ik{x)d¥n{x) and C2(0) = j ^(x)rfF„(x). 
Hence, tlie new vertex is given by 



9 = argmiugge, 



Step 2. During tliis step, given a support set {9i, . . . ,9p}, we sliould find a subset S 
of {9i, . . . ,9p} witli associated optimal / such that / minimizes (pq over the linear space 
generated by the functions {fg : 9^3} and, moreover, has only scalars aj > in the 
representation 

f=Yl A 

The basic step in finding S and / is to minimize, without restrictions on aj, the quadratic 
function 

ij{ai, . . . = </)g( ttj/e,) 

= X: (/ (x) rfx - 2 I c^F.(x)) + ^ E i: / ^^^^^^ dW^x) 

= a^u + -a^Va 

Define the n x p-matrix Y by Yij = fg .{xi). Note that this matrix only depends on the values 
of the current vertices at the observed sample. Also define the n-vector dhj di = {fixi))~^ 
and the n x n diagonal matrix D Da = di. Then nV = Y^D^DY and nu = Up — 2Y^d 
(using that the vertices are in fact probability densities, denoting by Hp the p-vector with all 
elements equal to n) and the optimal a G IR^ minimizing ip is the solution to the following 
linear system of equations 

{DYfDYa = 2Y^d - Hp 

If the matrix DY has full rank p, this system has a unique solution. 

Figure 2 shows the results of the SR algorithm based on a simulated dataset of size 
n = 500 where the mixing distribution is standard exponential. First it took 25 iterations 
to obtain the solution on an equidistant grid of size 500 in = [—2.47,7.96]. This 

grid-solution had eight support points. After that, 1085 steps of the fine tuning step of 
section El were taken, resulting in an estimate of the mixing distribution with five support 
points. 

Acknowledgement: We thank Jon Wakefield for drawing our attention to Mallet's paper. 
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(a) (b) 




Figure 2: (a) ML estimate of the mixing distribution with the true mixing distribution; (b) 
ML estimate of the mixture density with the true density; (c) the (alternative) directional 
derivative function evaluated at the ML estimate and (d) ML estimate of the mixture 
distribution with the empirical distribution function of the data. All pictures are based on a 
sample of size n — 500 from the standard exponential miocture of standard normals. 
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